In the application of artificial intelligence, achieving real-time interaction with AI has always been a significant challenge for developers and researchers. Among these challenges, integrating multimodal information (such as text, images, and audio) to form a coherent dialogue system is particularly complex. Although advanced large language models like GPT-4 have made some progress, many AI systems still struggle with achieving smoothness in real-time conversations, context awareness, and multimodal understanding, which limits their effectiveness in practical applications.